n-best hypothesis
Game-Oriented ASR Error Correction via RAG-Enhanced LLM
Jiang, Yan, Luo, Yongle, Zhou, Qixian, Liu, Elvis S.
With the rise of multiplayer online games, real-time voice communication is essential for team coordination. However, general ASR systems struggle with gaming-specific challenges like short phrases, rapid speech, jargon, and noise, leading to frequent errors. To address this, we propose the GO-AEC framework, which integrates large language models, Retrieval-Augmented Generation (RAG), and a data augmentation strategy using LLMs and TTS. GO-AEC includes data augmentation, N-best hypothesis-based correction, and a dynamic game knowledge base. Experiments show GO-AEC reduces character error rate by 6.22% and sentence error rate by 29.71%, significantly improving ASR accuracy in gaming scenarios.
- Asia > China (0.47)
- North America > United States (0.46)
- North America > Canada (0.14)
- (2 more...)
- Education (1.00)
- Energy > Oil & Gas > Midstream (0.92)
- Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.46)
- Asia > China (0.47)
- North America > United States (0.28)
- Europe > Germany (0.14)
- Europe > Czechia (0.14)
- Education (1.00)
- Energy > Oil & Gas > Midstream (0.93)
- Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.46)
- Europe (0.92)
- Asia > China (0.47)
- North America > United States (0.46)
- Education (0.93)
- Energy > Oil & Gas > Midstream (0.93)
- Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.46)
LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context
Yamashita, Natsuo, Yamamoto, Masaaki, Kokubo, Hiroaki, Kawaguchi, Yohei
Generative error correction (GER) with large language models (LLMs) has emerged as an effective post-processing approach to improve automatic speech recognition (ASR) performance. However, it often struggles with rare or domain-specific words due to limited training data. Furthermore, existing LLM-based GER approaches primarily rely on textual information, neglecting phonetic cues, which leads to over-correction. To address these issues, we propose a novel LLM-based GER approach that targets rare words and incorporates phonetic information. First, we generate synthetic data to contain rare words for fine-tuning the GER model. Second, we integrate ASR's N-best hypotheses along with phonetic context to mitigate over-correction. Experimental results show that our method not only improves the correction of rare words but also reduces the WER and CER across both English and Japanese datasets.
Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models
Ognjen, null, Rudovic, null, Dighe, Pranay, Su, Yi, Garg, Vineet, Dharur, Sameer, Niu, Xiaochuan, Abdelaziz, Ahmed H., Adya, Saurabh, Tewfik, Ahmed
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.
ProGRes: Prompted Generative Rescoring on ASR n-Best
Tur, Ada Defne, Moumen, Adel, Ravanelli, Mirco
Large Language Models (LLMs) have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best hypotheses generated during the beam search process. However, the best way to exploit recent generative instruction-tuned LLMs for hypothesis rescoring is still unclear. This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. Specifically, we introduce a new zero-shot method for ASR n-best rescoring, which combines confidence scores, LLM sequence scoring, and prompt-based hypothesis generation. We compare Llama-3-Instruct, GPT-3.5 Turbo, and GPT-4 Turbo as prompt-based generators with Llama-3 as sequence scorer LLM. We evaluated our approach using different speech recognizers and observed significant relative improvement in the word error rate (WER) ranging from 5% to 25%.
- North America > Canada > Quebec > Montreal (0.14)
- Asia > Philippines (0.05)
Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models
Tang, Zhiyuan, Wang, Dong, Huang, Shen, Shang, Shidong
Recent studies have demonstrated the efficacy of large language models (LLMs) in error correction for automatic speech recognition (ASR). However, much of the research focuses on the English language. This paper redirects the attention to Chinese. Firstly, we construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs, named the Chinese Hypotheses Paradise dataset (ChineseHP), which contains a wide range of scenarios and presents significant challenges. Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. Furthermore, we propose a straightforward method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses. The experimental results reveal that Pinyin regularization consistently enhances the error-correcting ability of LLMs when compared with those without regularization. The dataset is available on the website.
Transformer-based Model for ASR N-Best Rescoring and Rewriting
Kang, Iwen E., Van Gysel, Christophe, Siu, Man-Hung
Voice assistants increasingly use on-device Automatic Speech Recognition (ASR) to ensure speed and privacy. However, due to resource constraints on the device, queries pertaining to complex information domains often require further processing by a search engine. For such applications, we propose a novel Transformer based model capable of rescoring and rewriting, by exploring full context of the N-best hypotheses in parallel. We also propose a new discriminative sequence training objective that can work well for both rescore and rewrite tasks. We show that our Rescore+Rewrite model outperforms the Rescore-only baseline, and achieves up to an average 8.6% relative Word Error Rate (WER) reduction over the ASR system by itself.
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
Ghosh, Sreyan, Kumar, Sonal, Seth, Ashish, Chiniya, Purva, Tyagi, Utkarsh, Duraiswami, Ramani, Manocha, Dinesh
Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of visually-conditioned (generative) ASR error correction. Specifically, we instruct an LLM to predict the transcription from the N-best hypotheses generated using ASR beam-search. This is further conditioned on lip motions. This approach addresses key challenges in traditional AVSR learning, such as the lack of large-scale paired datasets and difficulties in adapting to new domains. We experiment on 4 datasets in various settings and show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs that is additionally equipped with lip motion cues to promote further research in this space
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)